fix(scripts): encode residual JSON control chars as \uXXXX instead of stripping#1872
Conversation
…pping json_escape() was silently deleting control characters (U+0000-U+001F) that were not individually handled (\n, \t, \r, \b, \f). Per RFC 8259, these must be encoded as \uXXXX sequences to preserve data integrity. Replace the tr -d strip with a char-by-char loop that emits proper \uXXXX escapes for any remaining control characters.
|
@mnriem @dhilipkumars, please take a look. |
There was a problem hiding this comment.
Pull request overview
This PR updates the Bash json_escape() fallback (used when jq is unavailable) to avoid silent data loss by encoding previously-unhandled JSON control characters (U+0000–U+001F) as \uXXXX sequences instead of stripping them.
Changes:
- Replaces
tr -dcontrol-character stripping with a per-character scan that emits\u%04xescapes for remaining control bytes. - Keeps existing short escapes (
\n,\t,\r,\b,\f) and quote/backslash escaping behavior unchanged.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
mnriem
left a comment
There was a problem hiding this comment.
Please address Copilot feedback. If not applicable please explain why
- Set LC_ALL=C for the entire loop (not just printf) so that ${#s} and
${s:$i:1} operate on bytes deterministically across locales
- Fix comment: U+0000 (NUL) cannot exist in bash strings, range is
U+0001-U+001F; adjust code guard accordingly (code >= 1)
- Emit directly to stdout instead of accumulating in a variable,
avoiding quadratic string concatenation on longer inputs
|
@mnriem All three Copilot findings have been addressed in 1b7ce3b:
Ready for re-review. |
There was a problem hiding this comment.
Pull request overview
Updates the Bash JSON escaping helper used by various scripts/bash/* utilities to preserve JSON validity by escaping (rather than stripping) remaining ASCII control characters when jq isn’t available.
Changes:
- Replaced control-character stripping with
\uXXXXescaping for remaining U+0001–U+001F bytes injson_escape(). - Added byte-wise iteration under
LC_ALL=Cto keep UTF-8 bytes passing through unchanged.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
mnriem
left a comment
There was a problem hiding this comment.
Please address Copilot feedback. You are almost there!
Replace code=$(printf ...) with printf -v code to assign the character code without spawning a subshell on every byte, reducing overhead for longer inputs.
There was a problem hiding this comment.
Pull request overview
Updates the bash JSON escaping fallback used by various scripts to produce valid JSON when jq isn’t available, by escaping (rather than stripping) remaining control characters.
Changes:
- Replace removal of unescaped U+0001–U+001F control characters with
\uXXXXescaping. - Add a byte-wise iteration step (under
LC_ALL=C) to preserve UTF-8 bytes while escaping control bytes.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Thank you! |
Summary
json_escape()incommon.shwas silently deleting control characters (U+0000–U+001F) not individually handled (\n,\t,\r,\b,\f) viatr -d, causing data loss\uXXXXsequencestr -dstrip with a char-by-char loop that emits proper\uXXXXescapes, preserving data integrityExample
Before:
"hello\x01world"→"helloworld"(SOH silently deleted)After:
"hello\x01world"→"hello\u0001world"(SOH properly escaped)Test plan
json_escapecorrectly encodes control characters like SOH (\x01), STX (\x02), etc. as\uXXXX\n,\t,\r,\b,\f) still use their short escape formscheck-prerequisites.sh --jsonandcreate-new-feature.sh --jsonto verify JSON output is valid